library(ISLR2)
lm.fit <- lm(mpg ~ horsepower, data=Auto)
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
About 60% of the variance in mpg can be attributed to
horsepower - and the F-statistic is such that we’d reject
the null hypothesis (that is, there is no relationship)
I’m not sure how to qualify “strong” - is that the slope? Because if so it isn’t.
Negative - an increase in horsepower lowers
mpg
predict(lm.fit, data.frame(horsepower=c(98)), interval = 'confidence')
## fit lwr upr
## 1 24.46708 23.97308 24.96108
predict(lm.fit, data.frame(horsepower=c(98)), interval = 'prediction')
## fit lwr upr
## 1 24.46708 14.8094 34.12476
As expected, the prediction interval is larger.
library(ggplot2)
ggplot(Auto, aes(x=horsepower, y=mpg)) + geom_point(size=2, shape=23) + geom_abline(intercept = lm.fit$coefficients[1], slope = lm.fit$coefficients[2], color="red")
plot(lm.fit)
Potential issues: - residuals aren’t uniformly distributed, which suggests a non-linear relationship - the residuals vs leverage plot identifies some points that would impact the regression line if they were removed
lm.fit2 <- lm(mpg ~ poly(horsepower, 2), data=Auto)
summary(lm.fit2)
##
## Call:
## lm(formula = mpg ~ poly(horsepower, 2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7135 -2.5943 -0.0859 2.2868 15.8961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.4459 0.2209 106.13 <2e-16 ***
## poly(horsepower, 2)1 -120.1377 4.3739 -27.47 <2e-16 ***
## poly(horsepower, 2)2 44.0895 4.3739 10.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
## F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
plot(lm.fit2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(subset(Auto, select=-c(name)), aes(colour=as.factor(origin)))
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
cor(subset(Auto, select=-c(name)))
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
lm.fit <- lm(mpg ~ ., data=subset(Auto, select=-c(name)))
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ ., data = subset(Auto, select = -c(name)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
There’s a clear relationship between predictors and the response,
with R^2 explaining 82% of the variance. Some predictors are not
significant like horsepower or acceleration.
year is statistically siginifcant given the p-value.
plot(lm.fit)
The Residuals vs Leverage plot indicates a few observations with very
high leverage.
TThe Residuals vs Fitted plot isn’t uniform, indicating the true relationship is unlikely to be linear.
lm.fit_multi <- lm(mpg ~ .^2, data=subset(Auto, select=-c(name)))
summary(lm.fit_multi)
##
## Call:
## lm(formula = mpg ~ .^2, data = subset(Auto, select = -c(name)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6303 -1.4481 0.0596 1.2739 11.1386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.548e+01 5.314e+01 0.668 0.50475
## cylinders 6.989e+00 8.248e+00 0.847 0.39738
## displacement -4.785e-01 1.894e-01 -2.527 0.01192 *
## horsepower 5.034e-01 3.470e-01 1.451 0.14769
## weight 4.133e-03 1.759e-02 0.235 0.81442
## acceleration -5.859e+00 2.174e+00 -2.696 0.00735 **
## year 6.974e-01 6.097e-01 1.144 0.25340
## origin -2.090e+01 7.097e+00 -2.944 0.00345 **
## cylinders:displacement -3.383e-03 6.455e-03 -0.524 0.60051
## cylinders:horsepower 1.161e-02 2.420e-02 0.480 0.63157
## cylinders:weight 3.575e-04 8.955e-04 0.399 0.69000
## cylinders:acceleration 2.779e-01 1.664e-01 1.670 0.09584 .
## cylinders:year -1.741e-01 9.714e-02 -1.793 0.07389 .
## cylinders:origin 4.022e-01 4.926e-01 0.816 0.41482
## displacement:horsepower -8.491e-05 2.885e-04 -0.294 0.76867
## displacement:weight 2.472e-05 1.470e-05 1.682 0.09342 .
## displacement:acceleration -3.479e-03 3.342e-03 -1.041 0.29853
## displacement:year 5.934e-03 2.391e-03 2.482 0.01352 *
## displacement:origin 2.398e-02 1.947e-02 1.232 0.21875
## horsepower:weight -1.968e-05 2.924e-05 -0.673 0.50124
## horsepower:acceleration -7.213e-03 3.719e-03 -1.939 0.05325 .
## horsepower:year -5.838e-03 3.938e-03 -1.482 0.13916
## horsepower:origin 2.233e-03 2.930e-02 0.076 0.93931
## weight:acceleration 2.346e-04 2.289e-04 1.025 0.30596
## weight:year -2.245e-04 2.127e-04 -1.056 0.29182
## weight:origin -5.789e-04 1.591e-03 -0.364 0.71623
## acceleration:year 5.562e-02 2.558e-02 2.174 0.03033 *
## acceleration:origin 4.583e-01 1.567e-01 2.926 0.00365 **
## year:origin 1.393e-01 7.399e-02 1.882 0.06062 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared: 0.8893, Adjusted R-squared: 0.8808
## F-statistic: 104.2 on 28 and 363 DF, p-value: < 2.2e-16
Note the difference in R with * and
: (the former is additive + interaction, the other is
interaction only).
Only acceleration and origin seem to be statistically significant?
lm.fit <- lm(mpg ~ weight + acceleration + displacement, data=subset(Auto, select=-c(name)))
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ weight + acceleration + displacement, data = subset(Auto,
## select = -c(name)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.6583 -2.7805 -0.3571 2.4971 16.2067
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.003203 1.864930 21.986 < 2e-16 ***
## weight -0.006174 0.000742 -8.320 1.51e-15 ***
## acceleration 0.186058 0.097970 1.899 0.0583 .
## displacement -0.010631 0.006524 -1.630 0.1040
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.279 on 388 degrees of freedom
## Multiple R-squared: 0.7017, Adjusted R-squared: 0.6994
## F-statistic: 304.3 on 3 and 388 DF, p-value: < 2.2e-16
lm.fit1 <- lm(mpg ~ weight + acceleration + I(log(displacement)), data=subset(Auto, select=-c(name)))
summary(lm.fit1)
##
## Call:
## lm(formula = mpg ~ weight + acceleration + I(log(displacement)),
## data = subset(Auto, select = -c(name)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5479 -2.6642 -0.3638 2.3460 16.8464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.4565452 4.9950040 12.304 < 2e-16 ***
## weight -0.0043803 0.0007195 -6.088 2.75e-09 ***
## acceleration 0.1302337 0.0896918 1.452 0.147
## I(log(displacement)) -5.2637315 1.2019343 -4.379 1.53e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.191 on 388 degrees of freedom
## Multiple R-squared: 0.7138, Adjusted R-squared: 0.7116
## F-statistic: 322.6 on 3 and 388 DF, p-value: < 2.2e-16
log of displacement gives an increased R^2 - seems
acceleration is fairly unsignificant!